Introduction to radiomics

Medical images are essential for studying diseases. For example, to examine how a patient’s tumor evolves, we perform multiple scans over time. Interpreting how the tumor has changed by looking at the images is not an easy task, and different doctors may predict different outcomes. This is where computers become very useful. Thanks to artificial intelligence techniques, we can show computers different medical images over time of a patient with a tumor, and they can identify features of these images to make predictions. They can help decide whether a treatment for a tumor is working or not, decide which alternative treatment to use, and much more, without having to wait for the tumor to get worse.

In the last decades, several areas of human activities have experienced an increase in digitization. In the case of medicine, a significant amount of information generated during clinical routine has been digitized. With this digital increase, new and better software has been developed to analyze the data. Thanks to research in Artificial Intelligence, these methods have become very powerful and available to any user, allowing doctors to use them on a daily basis.

As the amount of data increases, different Artificial Intelligence techniques - mainly Machine Learning and Deep Learning - are of high utility to deal with this large amount of data, an area known as “Big Data”. In simple terms, Big Data refers to sets of data whose size, complexity and speed of growth make it difficult to analyze using traditional tools.

Radiomics is a field of medical study whose purpose consists of extracting a large volume of features from medical images using data characterization algorithms. These features are known as radiomic features and can help to discover tumor patterns that are difficult for the human eye to analyze. Radiomics is a relatively new scientific filed. The first radiomics studies appeared in PubMed as recently as 2011.

It is believed that, in the end, radiomics will use specific treatments for each patient, will help doctors select the most appropriate treatment for each patient, and will be able to change treatments quickly if they do not work.

Some of the difficulties when performing radiomics research is that high quality images, with adequate size and complete datasets are needed. Different training and validation datasets are also necessary, to check if our algorithm works correctly. Another difficulty is class imbalance (classification problem where the number of observations per class is not equally distributed) and overfitting (when a statistical model exactly fits its training data). The main clinical applications of radiomics are radiogenomics and clinical outcome prediction.

Process of radiomics

Now we will briefly explain the process of radiomics.

  1. Obtain image: first, we obtain a medical image from a scanner (it can be obtained from multiple modalities: magnetic resonance imaging (MRI), computed tomography (CT), positron-emission-tomography (PET)…).
  2. Image segmentation: this means dividing the image into multiple segments, in other words, delineating the areas of interest in the image in terms of pixels or voxels. This step can be done manually, semi-automatically or fully automatically.
  3. Feature extraction: after image segmentation, we can extract the features and classify them.
  4. Finally, we use databases to share our data.
DICOM images

In this project, we will study images obtained from scans of the lungs. The images we will obtain from the scans are DICOM images. DICOM (Digital Imaging and Communications in Medicine) is the standard for the communication and management of medical imaging information and related data. It is used worldwide to store, exchange, and transmit medical images. It defines the formats for medical images that can be exchanged with the data and quality necessary for clinical use. DICOM incorporates standards for imaging modalities such as radiography, ultrasonography, computed tomography (CT), magnetic resonance imaging (MRI), and radiation therapy. The most common applications of this standard are the display, storage, printing, and transmission of images.

Radiomic techniques

To obtain the features, we can use multiple techniques. Radiomic techniques can be divided into four categories: intensity-based metrics, texture-based analysis, shape-based measures, and transform-based metrics. We will briefly discuss these techniques now.

  • Intensity-based metrics refers to statistics that are calculated from pixel values (or volumetric pixels called voxels). Additional information that can be obtained from analyzing the relationship between the voxels it is not considered. To compute the statistics, we select a region and extract the voxel values. They can be analyzed with histogram analysis. To quantify different aspects of the distribution we use average and variation, shape, and diversity.

  • Texture-based analysis refers to the analysis of image patterns, such as intensity, shape, or colors. Mathematical formulas based on the spatial relationship of voxels are used to quantify these concepts.

  • Shape-based measures refer to the study of different components of a structure. They can be divided into 1D metrics (measurement of the distance between two points. They are used to describe the magnitude of an abnormality), 2D metrics (calculated on cross-sectional planes and are used to calculate different parameters that are based on areas) and 3D metrics (attempt to enumerate different aspects of volumetric shape).

  • Transform-based metrics refers to the transformation of images from spatial coordinates to what is called a frequency domain, without losing any information.

Radiomic features

We can obtain multiple types of features from images. Qualitative features are used to describe lesions, and quantitative features are extracted from images using computer programs that apply mathematical algorithms. Now, we will focus on quantitative features, which can be divided into different subgroups.

  • Shape features describe the shape of the traced region of interest and its geometric properties like volume, maximum diameter along different orthogonal directions, maximum surface, tumor compactness, etc.

  • First-order statistics features describe the distribution of individual voxel values without taking into account spatial relationships. Some properties we obtain are the mean, median, maximum, minimum values of the voxel intensities on the image, skewness, kurtosis, etc.

  • Second-order statistics features include the textural features. They are obtained computing the statistical relationships between neighboring voxels.

  • Higher-order statistics features are obtained by statistical methods after applying filters or mathematical transforms to the images. Some examples are fractal analysis, Minkowski functionals or Laplacian transforms of Gaussian-filtered images.

Image display (oro.dicom)

In this section we will display a few images using R.

The R package oro.dicom is a collection of input/output functions for medical imaging data that conform to the Digital Imaging and Communications in Medicine (DICOM) standard. The R package RIA is another package that was developed to facilitate radiomic analysis of medical images.

First we load the libraries oro.dicom and RIA.

library(oro.dicom)
library(RIA)

Now we will display 3 images from the same patient (ID00007637202177411956430). To read the images we use readDICOMFile().

image1 <- readDICOMFile("/Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00007637202177411956430/11/11.dcm")
image(t(image1$img), col=grey(0:64/64), axes=FALSE, xlab="", ylab="")

image2 <- readDICOMFile("/Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00007637202177411956430/12/12.dcm")
image(t(image2$img), col=grey(0:64/64), axes=FALSE, xlab="", ylab="")

image3 <- readDICOMFile("/Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00007637202177411956430/13/13.dcm")
image(t(image3$img), col=grey(0:64/64), axes=FALSE, xlab="", ylab="")

Radiomic features without masks (RIA)

To compute the radiomic features of the images using RIA, we first need to convert the DICOM images to RIA_image class. To do so, we will use the load_dicom() function.

1 image

First, we compute the radiomic features of 1 image without a mask (we will do it for the first image displayed earlier). To compute the first-order statistics we use the first_order function.

DICOM <- RIA::load_dicom(filename="/Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00007637202177411956430/11", skipFirst128=FALSE, DICM=FALSE, boffset = 128)
DICOM = first_order(RIA_data_in = DICOM) # first order statistics
first_order <- RIA:::list_to_df(DICOM$stat_fo$orig) # list of first order statistics
name_characteristics <- rownames(first_order) # names of the first order statistics
library(rmarkdown)
paged_table(first_order)
10 images

Now we will compute the first order characteristics of 10 images (images 10 through 19) from the same patient (ID00007637202177411956430) without a mask.

first_order <- c()
res <- c()
first_image <- 10 # first image to study
last_image <- 19 # last image to study
for (i in first_image:last_image) { # for loop through all the images
  path <- file.path("/Users", "andrealetaalfonso", "Desktop","TFG", "images", "Kaggle", "ID00007637202177411956430", i, fsep="/") # path of every image
  DICOM <- RIA::load_dicom(filename=path, skipFirst128=FALSE, DICM=FALSE, boffset = 128) # load dicom image
  DICOM = first_order(RIA_data_in = DICOM, use_type = "single") # compute first order characteristics
  first_order[i] <- RIA:::list_to_df(DICOM$stat_fo$orig)
  res[i-first_image+1] <- first_order[i] # data.frame counts from 1 to 10 (not first_image to last_image)
} 

Finally, we combine all the results into a data.frame called ‘results’.

library(rmarkdown)
results <- do.call(cbind.data.frame, res)
colnames(results) <- seq(1, last_image-first_image+1) 
rownames(results) <- name_characteristics
#results:  results in a data.frame
paged_table(results, options = list(rows.print = 10, cols.print = 5))

Segmentation (lungct)

Now we will perform segmentation of the images. Recall that segmentation refers to delineating the areas of interest in the image in terms of pixels or voxels, in our case, the lung. We will use the patient ID00184637202242062969203.

We will use the library in R lungct. The lungct R package develops an image processing pipeline for computed tomography (CT) scans of the lungs.

First, we load the libraries needed to do the segmentation.

library(lungct) # to perform segmentation
library(dcm2niir) # to convert DCM to NIFTI
library(ANTsRCore)

Now we have to convert the data from DCM to NII. NIFTI (Neuroimaging Informatics Technology Initiative) is a data format for the storage of Functional Magnetic Resonance Imaging (fMRI) and other medical images.

res <- dcm2niir::dcm2nii(basedir="/Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00184637202242062969203")
checked <- check_dcm2nii(res) # Manipulating the output

Next we plot some images.

image <- antsImageRead(checked)
plot(image)

We create a mask.

mask <- segment_lung(image)

And finally, we plot the mask.

plot(mask$lung_mask)

Plot 1 image

res2 <- dcm2niir::dcm2nii(basedir="/Users/andrealetaalfonso/Desktop/TFG/images/img/Folder_images/15")
checked2 <- check_dcm2nii(res2) 
image2 <- antsImageRead(checked2)
plot(image2)

Mask of 1 image

mask2 <- segment_lung(image2)
plot(mask2$lung_mask)

Error

Radiomic features with masks (lungct)

In this section we will compute the radiomic features of 10 images with a mask.

first_order <- c()
for (i in 1:10) { # for loop through all the images
  path <- file.path("/Users", "andrealetaalfonso", "Desktop","TFG", "images", "img", "Folder_images", i, fsep="/") # path of every image
  DICOM <- RIA::load_dicom(filename=path, mask_filename=path, skipFirst128=FALSE, DICM=FALSE, boffset = 128) # load dicom image + mask
  DICOM = first_order(RIA_data_in = DICOM, use_type = "single") # compute first order characteristics
  first_order[i] <- RIA:::list_to_df(DICOM$stat_fo$orig)
} 
results2 <- do.call(cbind.data.frame, first_order)
colnames(results2) <- seq(1, 10)
rownames(results2) <- name_characteristics
results2 # results in a data.frame

With a mask we obtain different results, because now we are computing the radiomic features of the lung only, not the whole images

Let’s compare the radiomic features of the first image without a mask and with a mask.

data.frame(results[1], results2[1])

As we can see, we obtain different results.

Radiomic features with masks with multiple patients (RIA, lungct)

Now we will create multiple .txt files with the results of the radiomic features of different patients, using a mask.

The IDs of the patients are the following.

patient1 <- "ID00007637202177411956430"
patient2 <- "ID00009637202177434476278"
patient3 <- "ID00010637202177584971671"
patient4 <- "ID00011637202177653955184"
patient5 <- "ID00012637202177665765362"
patient6 <- "ID00014637202177757139317"
patient7 <- "ID00015637202177877247924"
patient8 <- "ID00019637202178323708467"
patient9 <- "ID00020637202178344345685"
patient10 <- "ID00023637202179104603099"
vector_patients <- c(patient1, patient2, patient3, patient4, patient5, patient6, patient7, patient8, patient9, patient10) # list of patients ID
for(patient in vector_patients){
  first_order <- c()
  path <- file.path("/Users", "andrealetaalfonso", "Desktop","TFG", "images", "Kaggle", patient, fsep="/") # path of every image
  number_files <- length(list.files(path)) # number of folders (ie. images) in each patient
  for (i in 1:number_files) { # for loop through all the images
    path <- file.path("/Users", "andrealetaalfonso", "Desktop","TFG", "images", "Kaggle", patient, i, fsep="/") # path of every image
    res <- dcm2niir::dcm2nii(basedir=path) # Convert the data from dcm to nii
    checked <- check_dcm2nii(res) # Manipulating the output
    image <- antsImageRead(checked)
    mask <- segment_lung(image) # Lung segmentation
    DICOM <- RIA::load_dicom(filename=path, mask_filename=path, skipFirst128=FALSE, DICM=FALSE, boffset = 128) # load dicom image + mask
    DICOM = first_order(RIA_data_in = DICOM, use_type = "single") # compute first order characteristics
    first_order[i] <- RIA:::list_to_df(DICOM$stat_fo$orig)
  } 
  results <- do.call(cbind.data.frame, first_order)
  colnames(results) <- seq(1, number_files)
  rownames(results) <- name_characteristics
  write.table(results, row.names = TRUE, col.names=NA, file=paste0(patient,".txt"), sep="\t") # table with the results
}

Using Python for segmentation

Import Python libraries.

import numpy as np # pip3 install numpy
import pandas as pd # pip3 install pandas
# pip3 install matplotlib
# pip3 install scipy
import skimage # pip3 install scikit-image
import os 
from skimage.morphology import ball, disk, dilation, binary_erosion, remove_small_objects, erosion, closing, reconstruction, binary_closing
from skimage.measure import label,regionprops, perimeter
from skimage.morphology import binary_dilation, binary_opening
from skimage.filters import roberts, sobel
from skimage import measure, feature
from skimage.segmentation import clear_border
from skimage import data
from scipy import ndimage as ndi
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
import dicom # pip3 install dicom
import scipy.misc
import pydicom # pip3 install pydicom
import matplotlib.pyplot as plt

Print list of images from a folder

from subprocess import check_output
path_images = "/Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00184637202242062969203/"
print(check_output(["ls", path_images]).decode("utf8"))

Plot 1 image

# pip3 install nltk==3.6.2
lung = pydicom.read_file("/Users/andrealetaalfonso/Desktop/TFG/images/Kaggle/ID00184637202242062969203/20.dcm")
slice = lung.pixel_array
slice[slice == -2000] = 0
plt.imshow(slice, cmap=plt.cm.gray)
plt.show()

Read all the images from a folder

def read_ct_scan(folder_name):
  # Read the slices from the dicom file
  slices = [pydicom.read_file(folder_name + filename) for filename in os.listdir(folder_name) if not file_is_hidden(filename)]
  # Sort the dicom slices in their respective order
  slices.sort(key=lambda x: int(x.InstanceNumber))
  # Get the pixel values for all the slices
  slices = np.stack([s.pixel_array for s in slices])
  slices[slices == -2000] = 0
  return slices
  
ct_scan = read_ct_scan(path_images) 

Plot some of the images from a folder

def plot_ct_scan(scan):
    f, plots = plt.subplots(int(scan.shape[0] / 20) + 1, 4, figsize=(25, 25))
    for i in range(0, scan.shape[0], 5):
        plots[int(i / 20), int((i % 20) / 5)].axis('off')
        plots[int(i / 20), int((i % 20) / 5)].imshow(scan[i], cmap=plt.cm.bone) 

plot_ct_scan(ct_scan)
plt.show()

Lung segmentation of 1 image

def get_segmented_lungs2(im, plot=False):
    # This funtion segments the lungs from the given 2D slice.
    if plot == True:
        f, plots = plt.subplots()
    # Step 1: Convert into a binary image. 
    binary = im < 604
    # Step 2: Remove the blobs connected to the border of the image.
    cleared = clear_border(binary)
    # Step 3: Label the image.
    label_image = label(cleared)
    # Step 4: Keep the labels with 2 largest areas.
    areas = [r.area for r in regionprops(label_image)]
    areas.sort()
    if len(areas) > 2:
        for region in regionprops(label_image):
            if region.area < areas[-2]:
                for coordinates in region.coords:                
                       label_image[coordinates[0], coordinates[1]] = 0
    binary = label_image > 0
    # Step 5: Erosion operation with a disk of radius 2. This operation is seperate the lung nodules attached to the blood vessels.
    selem = disk(2)
    binary = binary_erosion(binary, selem)
    # Step 6: Closure operation with a disk of radius 10. This operation is to keep nodules attached to the lung wall.
    selem = disk(10)
    binary = binary_closing(binary, selem)
    # Step 7: Fill in the small holes inside the binary mask of lungs.
    edges = roberts(binary)
    binary = ndi.binary_fill_holes(edges)
    if plot == True:
        plots.axis('off')
        plots.imshow(binary, cmap=plt.cm.bone) 
    # Step 8: Superimpose the binary mask on the input image.
    get_high_vals = binary == 0
    im[get_high_vals] = 0
    return im

Image

Mask of the image

get_segmented_lungs2(ct_scan[20], True)
plt.show()